37 research outputs found

    Recognizing point clouds using conditional random fields

    Get PDF
    Detecting objects in cluttered scenes is a necessary step for many robotic tasks and facilitates the interaction of the robot with its environment. Because of the availability of efficient 3D sensing devices as the Kinect, methods for the recognition of objects in 3D point clouds have gained importance during the last years. In this paper, we propose a new supervised learning approach for the recognition of objects from 3D point clouds using Conditional Random Fields, a type of discriminative, undirected probabilistic graphical model. The various features and contextual relations of the objects are described by the potential functions in the graph. Our method allows for learning and inference from unorganized point clouds of arbitrary sizes and shows significant benefit in terms of computational speed during prediction when compared to a state-of-the-art approach based on constrained optimization.Peer ReviewedPostprint (author’s final draft

    Action recognition based on efficient deep feature learning in the spatio-temporal domain

    Get PDF
    © 20xx IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.Hand-crafted feature functions are usually designed based on the domain knowledge of a presumably controlled environment and often fail to generalize, as the statistics of real-world data cannot always be modeled correctly. Data-driven feature learning methods, on the other hand, have emerged as an alternative that often generalize better in uncontrolled environments. We present a simple, yet robust, 2D convolutional neural network extended to a concatenated 3D network that learns to extract features from the spatio-temporal domain of raw video data. The resulting network model is used for content-based recognition of videos. Relying on a 2D convolutional neural network allows us to exploit a pretrained network as a descriptor that yielded the best results on the largest and challenging ILSVRC-2014 dataset. Experimental results on commonly used benchmarking video datasets demonstrate that our results are state-of-the-art in terms of accuracy and computational time without requiring any preprocessing (e.g., optic flow) or a priori knowledge on data capture (e.g., camera motion estimation), which makes it more general and flexible than other approaches. Our implementation is made available.Peer ReviewedPostprint (author's final draft

    Consistent depth video segmentation using adaptive surface models

    Get PDF
    We propose a new approach for the segmentation of 3-D point clouds into geometric surfaces using adaptive surface models. Starting from an initial configuration, the algorithm converges to a stable segmentation through a new iterative split-And-merge procedure, which includes an adaptive mechanism for the creation and removal of segments. This allows the segmentation to adjust to changing input data along the movie, leading to stable, temporally coherent, and traceable segments. We tested the method on a large variety of data acquired with different range imaging devices, including a structured-light sensor and a time-of-flight camera, and successfully segmented the videos into surface segments. We further demonstrated the feasibility of the approach using quantitative evaluations based on ground-truth data.This research is partially funded by the EU project IntellAct (FP7-269959), the Grup consolidat 2009 SGR155, the project PAU+ (DPI2011-27510), and the CSIC project CINNOVA (201150E088). B. Dellen acknowledges support from the Spanish Ministry of Science and Innovation through a Ramon y Cajal program.Peer Reviewe

    Recognizing point clouds using conditional random fields

    Get PDF
    Trabajo presentado a la 22nd International Conference on Pattern Recognition (ICPR-2014), celebrada en Estocolmo (Suecia) del 24 al 28 de agosto.Detecting objects in cluttered scenes is a necessary step for many robotic tasks and facilitates the interaction of the robot with its environment. Because of the availability of efficient 3D sensing devices as the Kinect, methods for the recognition of objects in 3D point clouds have gained importance during the last years. In this paper, we propose a new supervised learning approach for the recognition of objects from 3D point clouds using Conditional Random Fields, a type of discriminative, undirected probabilistic graphical model. The various features and contextual relations of the objects are described by the potential functions in the graph. Our method allows for learning and inference from unorganized point clouds of arbitrary sizes and shows significant benefit in terms of computational speed during prediction when compared to a state-of-the-art approach based on constrained optimization.This work was supported by the EU project (IntellAct FP7-269959), the project PAU+ (DPI2011-27510), and the CSIC project CINNOVA (201150E088). B. Dellen was supported by the Spanish Ministry for Science and Innovation via a Ramon y Cajal fellowship.Peer Reviewe

    Perceiving dynamic environments : from surface geometry to semantic representation

    Get PDF
    Perceiving human environments is becoming increasingly fundamental with the gradual adaptation of robots for domestic use. High-level tasks such as the recognition of objects and actions need to be performed far the active engagement of the robot with its surroundings. Nowadays, the environment is primarily captured using visual information in the form of color and depth images. Visual cues obtained from these images serve as a base upan which perception-related applications are developed. Far example, using appearance models far detecting objects and extracting motion infarmation far recognizing actions. However, given the complex variations of naturally occurring scenes, extracting a set of robust visual cues becomes harder here than in other contexts. In this thesis, we develop a hierarchy of tools to improve the different aspects of robot perception in human-centered, possibly dynamic, environments. We start with the segmentation of single images and extend it to videos. Afterwards, we develop a surface tracking approach along with the incorporation of our video segmentation method. We then investigate the higher-level tasks of semantic segmentation and recognition. Finally, we focus on recognizing actions in videos. The introduction of Kinectstyle depth sensors is relatively new and its usage in the field of robotics cannot be found befare half a decade ago. Such sensors enable the acquisition of high-resolution color and depth images at a low cost. Given this opportunity, we dedícate a bulk of our work to the exploitation of the depth infarmation obtained using such sensors, thereby pushing forward the state-of-the-art in perception problems. The thesis is conceptually grouped into two parts. In the first part, we address the low-level tasks of segmentation and tracking with depth images. In many cases, depth data gives a better disambiguation of surface boundaries of different objects in a scene when compared to their color counterpart. We exploit this information in a novel depth segmentation scheme that fits quadratic surface models on different surfaces in a competing fashion . We further extend the method to the video domain by initializing the segmentation results and surface model parameters from the previous trame for the next trame. In this way, we successfully create a video segmentation algorithm, in which the segment label belonging to each surface becomes coherent over time. We also devise a particle-filter-based tracker that uses depth data to track a surface. The tracker is made more robust by combining it with our video segmentation approach. The segmentation results serve as a useful prior for high-level tasks. In the second part we deal with such tasks which include (i) object recognition, (ii) pixelwise object class segmentation, and (iii) action recognition . We propase (i) to address object recognition by creating context-aware conditional random field models. We show the importance of the context in object recognition by modeling geometrical relations between different objects in a scene. We perform (ii) object class segmentation using a convolutional neural network. We introduce a novel distance-from-wall feature and demonstrate its effectiveness in generating better class proposals for objects that are clase to the walls. The final part of the thesis deals with (iii) action recognition. We propase a 2D convolutional neural network extended to a concatenated 3D network that learns to extrae! features from the spatio-temporal domain of raw video data. The network is trained to predict an action label for each video. In summary, several perception aspects are addressed with the utilization of depth infarmation where available. Our main contributions are (a) the introduction of a depth video segmentation scheme, (b) a graphical model far object recognition, and our proposals of the deep learning models for (e) object class segmentation and (d) action recognition.Los sistemas de percepción en entornos humanos son cada vez más importantes para la adaptación gradual de los robots a tareas domésticas. Tareas de alto nivel, tales como el reconocimiento de objetos y acciones, son necesarias para conseguir la participación activa del robot en dichas tareas. Hoy en día el entorno del robot es capturado principalmente usando información visual en forma de imágenes de color y profundidad. Las características visuales obtenidas a partir de estas imágenes sirven como base para el desarrollo de aplicaciones relacionadas con la percepción del robot. Por ejemplo, el uso de modelos de apariencia para la detección de objetos y la extracción de información del movimiento para el reconocimiento de acciones. Sin embargo, dado que las escenas pueden contener variaciones complejas, la extracción de un conjunto de características visuales puede convertirse en una tarea muy difícil. En la presente tesis hemos desarrollado una jerarquía de herramientas para mejorar diferentes aspectos de la percepción del robot en entornos humanos, posiblemente dinámicos. Esta tesis comienza con la segmentación de imágenes individuales, que luego se extiende a vídeos. Posteriormente, diseñamos un enfoque de seguimiento de superficies que incorpora nuestro método de segmentación de vídeos. A continuación, investigamos tareas de alto nivel para la segmentación semántica y el reconocimiento. Finalmente, nos centramos en el reconocimiento de acciones en vídeos. La introducción de sensores de profundidad tipo Kinect es relativamente nueva y su uso en el campo de la robótica empezó hace tan solo media década. Tales sensores permiten la adquisición de color y profundidad de imágenes de alta resolución a bajo coste. Dada esta oportunidad, dedicamos una buena parte de nuestro trabajo a la explotación de la información de profundidad obtenida a través de dichos sensores, mejorando el estado del arte en problemas de percepción. La tesis está conceptualmente dividida en dos partes. En primer lugar, abordamos las tareas de bajo nivel de segmentación y seguimiento con imágenes de profundidad. En muchos casos, los datos de profundidad permite una mejor desambiguación de los límites de las superficies de diferentes objetos de una escena en comparación con los datos de color. Explotamos esta información en un nuevo esquema de segmentación de profundidad que ajusta modelos cuadráticos de superficies de forma competitiva. Extendemos el método a vídeos de modo que la etiquetación de superficies resulte coherente en el tiempo. También proponemos un rastreador basado en un filtro de partículas que utiliza los datos de profundidad para realizar el seguimiento de una superficie. El seguimiento se hace más robusto al combinarlo con nuestro enfoque de segmentación en vídeo. Los resultados de la segmentación son usados como información a priori para tareas de alto nivel. En la segunda parte nos ocupamos de este tipo de tareas que incluyen el (i) reconocimiento de objetos, (ii) la segmentación de clases de objetos a nivel de píxeles, y (iii) el reconocimiento de acciones. Proponemos (i) abordar el reconocimiento de objetos mediante la creación de modelos de campos aleatorios condicionales sensibles al contexto. Realizamos (ii) la segmentación de la clase del objeto utilizando una red neuronal de convolución. Se introduce una nueva característica de distancia-a-paredes y demostramos su eficacia en la mejora de la clasificación de objetos que están cerca de las paredes. La parte final de la tesis se ocupa del (iii) reconocimiento de acciones. Proponemos una red neuronal de convolución 2D extendida a una red 3D concatenada, que aprende a extraer las características del dominio espacio-temporal de los datos de vídeo. La red está capacitada para predecir la etiqueta de acción para cada vídeo

    Joint segmentation and tracking of object surfaces in depth movies along human/robot manipulations

    Get PDF
    A novel framework for joint segmentation and tracking in depth videos of object surfaces is presented. Initially, the 3D colored point cloud obtained using the Kinect camera is used to segment the scene into surface patches, defined by quadratic functions. The computed segments together with their functional descriptions are then used to partition the depth image of the subsequent frame in a consistent manner with respect to the precedent frame. This way, solutions established in previous frames can be reused which improves the efficiency of the algorithm and the coherency of the segmentations along the movie. The algorithm is tested for scenes showing human and robot manipulations of objects. We demonstrate that the method can successfully segment and track the human/robot arm and object surfaces along the manipulations. The performance is evaluated quantitatively by measuring the temporal coherency of the segmentations and the segmentation covering using ground truth. The method provides a visual front-end designed for robotic applications, and can potentially be used in the context of manipulation recognition, visual servoing, and robot-grasping tasksPeer ReviewedPostprint (author’s final draft

    Combining semantic and geometric features for object class segmentation of indoor scenes

    Get PDF
    Scene understanding is a necessary prerequisite for robots acting autonomously in complex environments. Low-cost RGB-D cameras such as Microsoft Kinect enabled new methods for analyzing indoor scenes and are now ubiquitously used in indoor robotics. We investigate strategies for efficient pixelwise object class labeling of indoor scenes that combine both pretrained semantic features transferred from a large color image dataset and geometric features, computed relative to the room structures, including a novel distance-from-wall feature, which encodes the proximity of scene points to a detected major wall of the room. We evaluate our approach on the popular NYU v2 dataset. Several deep learning models are tested, which are designed to exploit different characteristics of the data. This includes feature learning with two different pooling sizes. Our results indicate that combining semantic and geometric features yields significantly improved results for the task of object class segmentation.This research is partially funded by the CSIC project MANIPlus (201350E102), and the project RobInstruct (TIN2014-58178-R).Peer reviewe

    Realtime tracking and grasping of a moving object from range video

    Get PDF
    Presentado al ICRA 2014 celebrado en Hong Kong del 31 de mayo al 7 de junio.In this paper we present an automated system that is able to track and grasp a moving object within the workspace of a manipulator using range images acquired with a Microsoft Kinect sensor. Realtime tracking is achieved by a geometric particle filter on the affine group. Based on the tracked output, the pose of a 7-DoF WAM robotic arm is continuously updated using dynamic motor primitives until a distance measure between the tracked object and the gripper mounted on the arm is below a threshold. Then, it closes its three fingers and grasps the object. The tracker works in real-time and is robust to noise and partial occlusions. Using only the depth data makes our tracker independent of texture which is one of the key design goals in our approach. An experimental evaluation is provided along with a comparison of the proposed tracker with state-of-the-art approaches, including the OpenNI-tracker. The developed system is integrated with ROS and made available as part of IRI's ROS stack.This work was supported by the EU project IntellAct FP7-269959, the project PAU+ DPI2011-27510 and the project CINNOVA 201150E088. B. Dellen was supported by the Spanish Ministry for Science and Innovation via a Ramon y Cajal fellowship.Peer Reviewe

    Realtime tracking and grasping of a moving object from range video

    Get PDF
    In this paper we present an automated system that is able to track and grasp a moving object within the workspace of a manipulator using range images acquired with a Microsoft Kinect sensor. Realtime tracking is achieved by a geometric particle filter on the affine group. Based on the tracked output, the pose of a 7-DoF WAM robotic arm is continuously updated using dynamic motor primitives until a distance measure between the tracked object and the gripper mounted on the arm is below a threshold. Then, it closes its three fingers and grasps the object. The tracker works in real-time and is robust to noise and partial occlusions. Using only the depth data makes our tracker independent of texture which is one of the key design goals in our approach. An experimental evaluation is provided along with a comparison of the proposed tracker with state-of-the-art approaches, including the OpenNI-tracker. The developed system is integrated with ROS and made available as part of IRI's ROS stack.Peer ReviewedPostprint (author’s final draft
    corecore